Wikipedia:Bots/Requests for approval/Seppi333Bot

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Approved.

Seppi333Bot

Operator: Seppi333 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 18:00, Wednesday, November 6, 2019 (UTC)

Function overview:

Update 4 massive wikitables every few days (via a task scheduler) with any new/revised content from a database that is updated at the same frequency

Automatic, Supervised, or Manual:

Automatic

Programming language(s):

Python

Source code available:

Wikipedia:WikiProject Molecular Biology/Molecular and Cell Biology/Human protein-coding genes#Python 3.7 code for generating the wikitables

Links to relevant discussions (where appropriate):

Wikipedia:Articles for deletion/List of human protein-coding genes 1

These sections include discussions about the bot itself (i.e., writing the tables using User:Seppi333Bot):

WT:MCB/Archive_11#... Or just a resource for the gene symbols? (this section discussed the creation of the list pages and then the idea of using a bot to update them)
WT:MCB#Wikipedia:Bots/Requests for approval/Seppi333Bot

These links pertain more to the wikitables generated by the python script, for which the bot function in that script automates page writing via Seppi333Bot, than the bot itself:

Wikipedia talk:WikiProject Molecular Biology/Molecular and Cell Biology/Human protein-coding genes – the centralized talk page for the 4 list articles
WT:MCB/Archive_11#Problematic links in the wikitables – these issues have all been fixed
WT:WikiProject Disambiguation#If my understanding of WP:Disambiguation is correct, ~200 pages need to be converted to DABs – WikiProject Disambiguation is helping to fix this

Edit period(s):

Once every 1–7 days

Estimated number of pages affected:

Exactly 4 articles (listed below)
One page in the project namespace (Wikipedia:WikiProject Molecular Biology/Molecular and Cell Biology/Human protein-coding genes)

Namespace(s):

Mainspace

Exclusion compliant (Yes/No): Yes

Function details:

You can easily check the function of this bot on the python code page; the functions were written in a modular fashion. It performs the following tasks in sequence:

Download the complete set of human protein-coding genes from the HGNC database hosted on an ftp server and write it to a text file.
Read from the text file and write 4 text files containing 4 wikitables, each with ~5000 rows of gene entries.
Login to Wikipedia on the User:Seppi333Bot account, open the 4 text files that were written to the drive, then:
1. Open List of human protein-coding genes 1, replace the source code with the wikitext markup in the 1st text file, and save the page.
2. Open List of human protein-coding genes 2, replace the source code with the wikitext markup in the 2nd text file, and save the page.
3. Open List of human protein-coding genes 3, replace the source code with the wikitext markup in the 3rd text file, and save the page.
4. Open List of human protein-coding genes 4, replace the source code with the wikitext markup in the 4th text file, and save the page.
Delete all 5 text files from the drive.

Tentative additional function:

Open Wikipedia:WikiProject Molecular Biology/Molecular and Cell Biology/Human protein-coding genes and update the 8 genes listed in this navbox using regex whenever the bot updates the lists.

Discussion

FWIW, I've been testing the runBot() function on ~~these~~ sandbox pages since yesterday morning, so I know it works exactly as intended. I just need approval to run this bot in the mainspace. The other functions in the script have been in operation since last week.

Also, does anyone know how I can extend the timeout duration when saving a page? The Wikitables are massive, so publishing the edits takes a while and I can't seem to find a timeout setting in the Pywikibot library. Seppi333 (Insert 2¢) 18:00, 6 November 2019 (UTC)[reply]

Don't know enough about Python to answer the timeout setting question, but what does happen when a human editor alters the tables in any way? Will their edits be overwritten? Jo-Jo Eumerus (talk) 18:05, 6 November 2019 (UTC)[reply]

Well, since the bot blanks the page prior to writing content from the text files, yes. The way I look at these pages is similar to a protected template; you sort of need to request a change to the underlying source code in order for it to stick on the page itself. Seppi333 (Insert 2¢) 18:26, 6 November 2019 (UTC)[reply]

@Jo-Jo Eumerus: I realize this is a bit premature given that I'm not going to run the bot in the mainspace without approval of my bot; but, I've attempted to address the issue you mentioned by notifying other editors about the automatic page updates and where to make an edit request using these edit notices: Template:Editnotices/Page/List of human protein-coding genes 1 and Template:Editnotices/Page/List of human protein-coding genes 2. If you'd like to revise the wording or remove/change the image in the edit notices, please feel free to do so. I know some people find File:Blinking stop sign.gif a bit annoying given how attention-grabbing it is, but I chose it in this particular case to ensure that editors see the edit notice. Seppi333 (Insert 2¢) 02:15, 24 November 2019 (UTC)[reply]

Well, we've had problems in the past with bots adding incorrect content and then re-adding it after human editors tried to fix it, that's why I asked. An edit notice seems like a good idea, perhaps it should also say where to ask about incorrect edits by the bot. Jo-Jo Eumerus (talk) 09:12, 25 November 2019 (UTC)[reply]

@Jo-Jo Eumerus: Hmm. From an algorithmic standpoint, I'm almost positive that the only circumstance in which the bot could write incorrect/invalid content to the page is if the HGNC's protein_coding-gene.txt file is corrupt, since a download failure would raise an error and stop my script (~~I think~~ this is true - switching to airplane mode mid-download raised an error and stopped the script; ~~I'll test in a few minutes to be certain and reprogram the script to stop if I'm wrong~~). Should I add another line to the edit notice asking users to revert the page to the last correct version and contact an administrator to block the bot in the event that happens? Seppi333 (Insert 2¢) 21:52, 25 November 2019 (UTC)[reply]

@Jo-Jo Eumerus: I realized that, because my bot is exclusion compliant, literally any editor can block my bot from editing those pages. Since it's faster for an editor to block my bot by editing the pages upon identifying a problem (which would need to be done anyway) than blocking it by contacting an administrator, I've opted to list that method in the edit notices (e.g., see Template:Editnotices/Page/List of human protein-coding genes 3). It also would save me time to get it up and running again since I wouldn't have to appeal a block. If people start to abuse that block method though, I'll have to make my bot non-exclusion compliant. Seppi333 (Insert 2¢) 10:36, 26 November 2019 (UTC)[reply]

Note: This bot appears to have edited since this BRFA was filed. Bots may not edit outside their own or their operator's userspace unless approved or approved for trial. AnomieBOT ⚡ 18:24, 7 November 2019 (UTC)[reply]
Seems to be a sandbox page. Seppi333: would it make sense for you to copy the content to mainspace using your main account, to test for objections to the content itself? –xeno^talk 14:52, 16 November 2019 (UTC)[reply]
That's my intent once I've ironed out the issues with the wikilinks mentioned here; I intend to fix those problems in the coming week and then move it to the mainspace. Seppi333 (Insert 2¢) 19:37, 16 November 2019 (UTC)[reply]

@Xeno: Since I've fixed all of the dablinks (~750ish), I moved these pages to List of human protein-coding genes 1 and List of human protein-coding genes 2 and listed them in a hatnote in the Human genome#Human protein-coding genes section. To centralize discussions about the lists, I also left messages on the talk pages of the lists about requesting that new messages be posted on the talk page where the source code is located.

There are still a few mistargeted wikilinks (likely ~100 or so) in the table; I'm going to find and fix all of those soon after I write the natural language processing script that I mentioned in this section. I've identified all of them in User:Seppi333/GeneListNLP#mistargetedLinks.txt.

In any event, since these pages are now in the article space, approval of my bot would make it much easier for me to update those lists; it wouldn't hurt to wait a few days to see if anyone takes issue with any parts of the tables or the layout of the article though. I doubt anyone will object to the existence of the pages themselves since the complete set of human protein-coding genes constitutes the human exome and the human protein-coding genes section cites literature that discusses those genes as a set, so the existence of a list of them on Wikipedia is justifiable by the notability criteria in WP:Notability#Stand-alone lists. It's pretty self-evident that every gene in the table is notable as well given that virtually all ~20000 entries contain 2+ links to databases that provide information about the gene and encoded protein. Seppi333 (Insert 2¢) 21:16, 23 November 2019 (UTC)[reply]

So the first page is about 875k pre expansion of templates, seems a bit large (didn’t look at page 2). Onetwothreeip moved it to draft space citing the size and unspecific other reasons, perhaps they can comment here as to the latter and you can think about splitting up the page into 4 or more pieces? –xeno^talk 09:58, 24 November 2019 (UTC)[reply]

The external links are completely unnecessary and should be removed. I tried to remove them myself but was unable to. Onetwothreeip (talk) 10:50, 24 November 2019 (UTC)[reply]

@Xeno: The List of human genes article (it's technically a WP:Set index article due to the fact that the chromosome pages aren't list articles) breaks up the list by chromosome; I don't think that's a particularly useful way to break up a complete list of protein-coding genes since the gene families (i.e., groups of gene symbols that share a common root symbol that is followed by a number; e.g.,, TAAR and HTR are root symbols for groups of genes that are covered in those articles) within the tables would be split up and spread across multiple list pages. This is because the genes within a given gene family are not necessarily located on the same chromosome (e.g., the 13 HTR genes are located on 10 different chromosomes). The proteins encoded by the genes within a given family share a common function; consequently, those gene groups constitute sub-lists of related entries in the wikitables, so splitting gene groups across two pages isn't ideal.

With that in mind, I think splitting the list into 4 parts would be fine; I'd rather not split it any more than that because it becomes progressively harder to navigate the list and increasingly likely that a more gene groups will be split across two pages. I might end up replacing the locus group column with the gene location, pending feedback at WT:MCB. If I were to do that, the page size would shrink by about 50-100k more.

@Onetwothreeip: (1) not notifying me when you draftified those pages was just rude; (2) I have no clue why you didn't just ask me about the page size; (3) I know that there are no fixed page size limits for list articles specified in policy/guideline pages, so it really irritates me that I had to look for the list of the largest pages in the mainspace instead of you linking to it directly to point out the page size of those lists relative to the other listed in Special:LongPages. Seppi333 (Insert 2¢) 13:14, 24 November 2019 (UTC)[reply]

Is this bot approved for a trial? It appears to be editing in mainspace and projectspace without approval. Special:Contributions/Seppi333Bot. ST47 (talk) 23:54, 24 November 2019 (UTC)[reply]
It isn't. I've blocked it for editing without approval. — JJMC89 (T·C) 00:12, 25 November 2019 (UTC)[reply]

All of the edits by the bot were only performed in the project space; it didn’t occur to me until right now that the revision history after a page move would appear to show the bot editing in the mainspace though. I suppose I should’ve just copy/pasted the source code instead.

In any event, the project space pages I was editing were - and still are - merely being used as sandboxes. It would’ve made no difference at all whether I created them in my user space or as a subpage of WP:MCB. I figured creating/editing a new page in the project space with a bot wouldn’t be against policy, but apparently my interpretation wasn’t correct. So, my bad. @JJMC89: If you’re willing to unblock the bot, I’ll confine it’s edits to my user space. I don’t intend to perform any further edits with my bot if you unblock it though. Seppi333 (Insert 2¢) 02:25, 25 November 2019 (UTC)[reply]

On an unrelated note, I’d appreciate input from a BAG editor on this request. I can update the tables in the article space with or without a bot. The only purpose it serves is for my convenience: I don’t want to have to regularly update the tables. Failing approval for my bot, I’ll probably just end up updating the tables on a monthly basis after a while. With approval, the tables would be updated daily since I’d just use a task scheduler to run the bot script every 24 hours. Seppi333 (Insert 2¢) 02:48, 25 November 2019 (UTC) There are at most 226 mistargeted links present in the current revisions of the 4 list pages; ~~I could fix those right now, but I'd prefer to wait for a trial to use my bot to rewrite the relevant links as piped links due to the amount of time it takes to publish an edit in my browser window.~~ Seppi333 (Insert 2¢) 10:36, 26 November 2019 (UTC)[reply]

I'm going to fix this in the next 24 hours since I notified WT:WikiProject Disambiguation and they've rendered some assistance with disambiguating the links in that list. Seppi333 (Insert 2¢) 07:21, 27 November 2019 (UTC)[reply]

Done I've fixed the corresponding links in all 4 articles. I have no further updates/changes planned for the tables in the algorithm; all the wikilinks are now correctly disambiguated and without targeting issues. If someone actually responds to this request at some point, please ping me. Seppi333 (Insert 2¢) 15:04, 27 November 2019 (UTC)[reply]

Feedback requested

{{BAG assistance needed}}. A trial I think? –xeno^talk 02:52, 25 November 2019 (UTC)[reply]
xeno, I find it difficult to approve a trial for a bot run on pages that are of a questionable nature. In looking through the various discussions I'm seeing concerns about the necessity of the external links, the size of the pages, the need to update every day, and the necessity of the pages themselves. The bot has been (more or less) shown to run properly at its current remit, but if the scope of the page changes then the entire bot run would almost need to go back through a trial. At the very least I'd like to see a consensus about the format/layout/size of the article before approving this. Primefac (talk) 15:33, 8 December 2019 (UTC)[reply]

@Primefac: You mentioned multiple issues here, so I figured I'd follow up on each individually. To summarize what I've stated below, I'm flexible on both the page size (which has since been reduced) and update frequency as well as how to format the references for each entry (NB: these are currently included as ELs). I'm not open to deleting the gene/protein references altogether due to the problem it would create w.r.t. the stand-alone list guideline (WP:CSC - 1st bullet), as explained below. I'm not sure that I understand your concern regarding the page scope.

Re: the need to update every day - they really don't need to be updated every day. The database is updated daily, so I figured that would be the most natural frequency. I'm flexible on this point, so if you think the bot should update less frequently, I think an update frequency of anything between once per day to once per week seems fine. An update period of >1 week seems arbitrarily long IMO given that there have been substantive changes to the database entries that are included in the wikitables several times a month since the time that I created them; there's no way to predict when such changes occur.
Re: the size of the pages - I'm actually somewhat flexible on this. Around 2 weeks after you replied here, Pigsonthewing raised this issue here: "Page size" thread. In response, I reduced the page size of each list by just over 100,000 bytes and mentioned how I might be able to reduce the page size even further. It's probably worth reading this thread for context.
Re: the necessity of the external links - Onetwothreeip and I are at an impasse on this issue; we've been discussing it here: "Still problems with these articles" thread. He believes they serve no purpose and should be deleted instead of being converted to a citation within reference tags. Since there are approximately 8000 redlinked gene symbols in the tables and on the basis of the first bullet point under WP:Stand-alone lists#Common selection criteria, I included an external link for the gene (HGNC) and encoded protein(s) (UniProt) because they serve as an official link for the gene/protein (see the discussion thread for an explanation as to what makes them "official") while simultaneously serving as a reference for the gene and protein (NB: WP:ELLIST explicitly states: In other cases, such as for lists of political candidates and software, a list may be formatted as a table, and appropriate external links can be displayed compactly within the table:... In some cases, these links may serve as both official links and as inline citations to primary sources.. The notability for the redlinked entries in these lists is not readily apparent and whether those entries are relevant to the list is not easily verifiable w/o the HGNC and UniProt links. Many of the lists on the first page of Special:LongPages also employ this method to cite list entries since an external link uses less markup (and hence, reduces the page size) relative to adding references tags for each reference. For context, <ref></ref> is exactly 9 characters long; if I placed 1 reference tag around the HGNC and UniProt ELs (w/o any additional reference formatting) for all 5000 gene entries in each wikitable, it would add 9*2*5000=90,000 bytes to the size of each page.

I object to your marginalisation of opposition to this. I haven't seen many articles that are blue on your list that really are about the resp. gene. Practically all such articles really are about the resp. protein encoded by the gene (even if the article begins "XYZ is a gene"). So this list is a fake from the start. Reason is, of course, that genes have no function per se except that of being the object of transcription: it's all in the proteins. --SCIdude (talk) 07:05, 19 December 2019 (UTC)[reply]
I object to your marginalisation of opposition to this what are you referring to, specifically? I don't really follow your argument. No articles about a protein-coding gene are about just the gene or the encoded protein; they're about both. There are a very limited number of circumstances where it's appropriate to topically separate a gene and an encoded protein across 2+ articles (e.g., it might be prudent to do so when a gene encodes multiple proteins), but even in such cases, the article scope still encompasses the encoded proteins in the gene article and the gene which encodes the protein in the protein articles. The way the bluelinked articles are supposed to be written is covered in MOS:MCB#Sections. The majority of those sections relate to the protein because the encoded protein is the mechanism through which this class of genes affects the organism. Seppi333 (Insert 2¢) 11:07, 19 December 2019 (UTC)[reply]
So, at least, the pages are misnamed, they are not "lists of human protein-coding genes". --SCIdude (talk) 14:09, 19 December 2019 (UTC)[reply]

Re: if the scope of the page changes then the entire bot run would almost need to go back through a trial - I'm not certain I fully understand what you meant by this. The only way the list's scope could change is if I were to change the underlying dataset (i.e., "protein-coding gene.txt") used by my algorithm to a different one; that would entail a complete rewrite of my algorithm (i.e., the wikilink dictionary would need to be deleted or replaced since it'd be moot, the function that generates the tables would have to be entirely rewritten to reflect the scope and structure of the new dataset, and the dataset download function would require a least partial revision). I do not intend to, and wouldn't even consider, doing something like that. If you meant something else though, please clarify.

I'm not sure how I might be able to obtain consensus on these issues since most people don't really care about lists like this; I'll try asking for feedback about the page size of the lists and the external links at WT:MCB and WT:WikiProject Molecular Biology to see if anyone is willing to offer some though. If you have any advice on how I might establish a consensus or have any feedback about the lists, I'd appreciate your input. Seppi333 (Insert 2¢) 05:07, 19 December 2019 (UTC)[reply]

Given that all the articles on Special:LongPages are too long, it's not a good idea to use them as an example of what should be done. There is simply no reason why each gene should be referenced, let alone given an external link. Onetwothreeip (talk) 10:01, 19 December 2019 (UTC)[reply]

While I agree that examples don't make for a very good argument, these lists are also very long; the 20000 entries in these list pages probably constitutes one of the longest lists on WP by number of list entries. While the inherent notability of a protein-coding gene may be obvious to some, I'm virtually certain that some editors who aren't privy to this discussion will object to the inclusion of all the redlinks at some point, justifying their argument with Wikipedia:What Wikipedia is not#Wikipedia is not an indiscriminate collection of information, if there are no references provided for them. Given the sheer size of this list (due to its completeness) and the large number of redlinks in it, I think it's necessary to include at least 1 of the current ELs to avoid future objections pertaining to WP:CSC. It's worth pointing out that the sole templated reference that's currently included on these pages merely links to a webpage with download links for machine-readable text/json files that are virtually unreadable to a person (e.g., this is the rather incomprehensible text file that the algorithm uses to generate the tables). TBH, I wouldn't really care about cutting these links if doing so didn't create a different problem. There may be a solution that would address both concerns, but I can't think of one at present. Seppi333 (Insert 2¢) 11:07, 19 December 2019 (UTC)[reply]

I'm concerned not only about the size of these pages, but also the "indiscriminate" aspect. What is the use case for them? Who will use them? What do they offer that a category would not? Or that Wikidata does not? If a significant part of their function is to provide red links for others to work on, then perhaps a Wikipedia-space project page would be appropriate, where the list(s) could be built by WP:Listeria (as indeed they could in mainspace, at least on more enlightened Wikipedias)? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 12:14, 19 December 2019 (UTC)[reply]

Sigh. I'm not sure it's technically correct to use the term "indiscriminate" in this case because the list is literally complete - there are no missing genes in this list which are known to encode a protein - so the list entry selection criteria can't possibly be random. This list was created as a side project when I needed a complete list of protein-coding genes for training a speech to text AI, so my own use case inspired the creation of the lists. See here, namely the collapsed section. There are undoubtedly other use cases. HGNC doesn't provide an easily accessible and complete, human-readable list, so I put one here while I was working on my transcription AI. For context on what the entire list represents, search for the word "exome" on this page and read what I wrote previously. FWIW, I imagine there's more use cases related to whole exome sequencing than training an AI, since that list represents the whole exome (at least, the vast majority of it; there's likely a few hundred protein-coding genes - which I imagine are mostly smORFs - that have yet to be identified, hence why I want a bot to update the list).

Addendum: Like I said before, I'm open to splitting the list into 10 pages, but that would nearly triple the amount of time it takes me to update the lists. It already takes me several minutes to do it with the current 4, so I'm not going to split the list further unless either (1) someone offers to help me perform manual updates on 10 pages or (2) this bot is approved to do it for me. The lists have actually been out of date w.r.t. the HGNC database for several days since the gene counts differ, but I haven't updated it yet because performing an update is both tedious and an annoying time sink even with 4 pages. You can probably imagine how motivated I'd be to regularly perform manual updates on 10 pages. Seppi333 (Insert 2¢) 13:04, 19 December 2019 (UTC)[reply]

@Pigsonthewing: 2nd addendum − I stumbled across Wikipedia:Manual of Style/Tables#Size, which links to WP:SPLITLIST; both advocate against splitting lists, and tables in particular, at arbitrary cutpoints. I don't think splitting the list into 4 tables was arbitrary since the pages were by far the largest pages in the mainspace prior to the split into 4 distinct pages. However, splitting the tables across 10 pages doesn't seem like it would be compliant with the MOS guideline on article size (WP:SPLITLIST) or tables (Wikipedia:Manual of Style/Tables#Size) based upon what those sections indicate. There are two final means I can think of that would allow for further reduction of the page size across all 4 pages, without loss of content or context about what the list entries reflect.

The first is to create a template:UniP redirect to template:uniprot and change all of the uniprot templates in the tables to the redirect. This would reduce the bytes per entry by 3, so the first 3 pages would shrink by 3x5000=15000 bytes without any visible change to the lists. The second would be to remove the status column AND all of the gene entries with a status of "Entry Withdrawn", which are the last 100 entries (indices 19201−19300) in List of human protein-coding genes 4, so that the lists only contain approved gene symbols. I would then need to indicate somewhere on these pages that all the gene symbols in these lists are the approved symbols for their corresponding genes. For context, "Entry withdrawn" has a very specific meaning in the HGNC database: it indicates "a previously approved HGNC symbol for a gene that has since been shown not to exist." Cutting that column would reduce the page size of the first 3 pages by 10 bytes per row ⇒ 10x5000=50000 bytes per page. If these both of these changes sound fine to you and no one has any objections, I can go ahead and reduce the page size of the first 3 pages by 65000 bytes, but that seems to be as far as I can reduce it simply by restructuring the tables in a manner that doesn't sacrifice content. Let me know what you think when you get a chance. Seppi333 (Insert 2¢) 05:14, 25 December 2019 (UTC)[reply]

Actually it's not easy to do this in Wikidata, because of the conflated content of these articles and the multitude of concept types they are linked from (gene / protein / protein family). With this query I get 5,911 missing articles:

SELECT DISTINCT ?gene ?geneLabel
 {
   ?gene wdt:P31 wd:Q7187 .
   ?gene wdt:P703 wd:Q15978631 .
   ?gene wdt:P688 ?protein .
   ?protein wdt:P361 ?family .
    MINUS { 
    ?article 	schema:about ?gene ;
 			    schema:isPartOf <https://en.wikipedia.org/> .
    }
    MINUS { 
    ?article 	schema:about ?protein ;
 			    schema:isPartOf <https://en.wikipedia.org/> .
    }
    MINUS { 
    ?article 	schema:about ?family ;
 			    schema:isPartOf <https://en.wikipedia.org/> .
    }
   SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
 }

Click here to launch the Wikidata query

So why are pages needed? --SCIdude (talk) 14:27, 19 December 2019 (UTC)[reply]

@SCIdude: That's sort of an odd question. I don't think any article on WP is "needed"; it's really just a question of whether an article meets the notability criteria.

My question was (cryptic): why are these list pages needed if you can get the full list with such a query? --SCIdude (talk) 15:10, 19 December 2019 (UTC)[reply]

Well, I suppose there's 2 reasons. The first is that an article is readily accessible; a query requires a working knowledge of SQL or a pre-written script to run. The second, which I think is more important, is that the WP lists would remain fully up-to-date with the HGNC database, whereas WD is updated by PBB, which pulls gene data from NCBI gene, which in turn pulls its approval/nomenclature data from HGNC. So basically, there's no intermediate database that might delay data currency with HGNC (provided that a bot updates it). Seppi333 (Insert 2¢) 00:25, 20 December 2019 (UTC)[reply]

Also, if there's ~6000 missing articles in wikidata and ~8000 redlinks in the tables, then there's roughly 2000 gene symbols that lack a redirect to an existing article on the gene/protein (they're typically categorized with {{R from gene symbol}}). Do you know of any simple methods to determine which gene symbols those articles correspond to using WD's data? Seppi333 (Insert 2¢) 15:01, 19 December 2019 (UTC)[reply]

To clarify, the query lists those gene items that 1. don't have an enwiki article and 2. where the encoded protein(s) don't have an enwiki article, and 3. where the associated families don't have an enwiki article. So it is exactly what your list pages do. --SCIdude (talk) 15:10, 19 December 2019 (UTC)[reply]

Also there are no missing items in Wikdata, we have all human genes and proteins, and our InterPro family import goes up to IPR040000, so is quite recent. That's why this query does what I said above. --SCIdude (talk) 15:14, 19 December 2019 (UTC)[reply]

I understood what you meant; my point is that there are more sitelinks to gene/protein articles in wikidata than there are bluelinks in the gene lists. The only reason that would happen is if there are articles on genes/proteins which are not located at the pagename of the corresponding gene symbol and which lack a redirect from that gene symbol. I was hoping you knew of a way to identify which redlinked gene symbols need to be redirected to an existing gene/protein article. Seppi333 (Insert 2¢) 15:21, 19 December 2019 (UTC)[reply]

WD usually has no item that links to a redirect, so this can't be done from WD alone. But I have put a useful query on your talk page. Also I just see that the above query counted articles on protein families multiple times because families are associated with every member of the family. The rewritten query would remove the third MINUS block and list all gene items that 1. don't have an enwiki article and 2. where the encoded protein(s) don't have an enwiki article. The number is now 6,078. --SCIdude (talk) 17:24, 19 December 2019 (UTC)[reply]

@SCIdude: That dataset you gave me is exactly what I needed, thanks! I should be able to process it and identify the missing gene symbol links and the corresponding redirect targets fairly easily. I can create about 2000 redirects using that data, but it’s going to require another bot approval request. Sigh. Seppi333 (Insert 2¢) 22:27, 19 December 2019 (UTC)[reply]

@Boghog: You asked me about use cases for the list pages in Wikipedia:Bots/Requests for approval/Seppi333Bot 2#Discussion (Pigsonthewing asked about this above as well); since I just learned that amphetamine activates 2 human carbonic anhydrase isoforms from its recently updated IUPHAR/BPS entry and subsequently learned, following a literature search, that it activates 7 human carbonic anhydrase isoforms per [1], I needed to know how many carbonic anhydrases there are in humans. So, I went to two places: carbonic anhydrase and list of human protein-coding genes 1, then ctrl-F searched for CA1 on the second page. Based upon the prose and the table in Carbonic anhydrase#Families, it's not apparent that CA8, CA10, and CA11 are human genes/proteins because no context is provided, nor is it readily apparent that CA15 is a human pseudogene unless you actually read the tissue distribution column in the table of mammalian genes. It was simple to determine how many carbonic anhydrase proteins there are in humans from the human gene list and, using the UniProt links, I was able to access more up-to-date information than is present in the carbonic anhydrase article; e.g., the carbonic anhydrase article says CA8's function is unknown and (prior to my impending revision) CA8 didn't state much more, but UniProt states that CA8 has undergone catalytic silencing due to the Arg-116 residue replacing a His at that location, that a rare form of cerebellar ataxia is caused by autosomal recessive mutations in the CA8 gene, and (in an expression link) that it has high cerebellar expression in humans. Hence, CA8 is clearly a functional human protein despite lacking catalytic activity. I probably would've ignored CA8, CA10, and CA11 were it not for the gene lists.

In the future, I'm likely going to use the human gene lists in tandem with articles on gene families I'm unfamiliar with like this to fact-check them, as I don't know whether or not the article lists a human pseudogene (e.g., like TAAR1 did before I started editing it) as a functional protein-coding gene or contains outdated information. Re wikidata:Topic:Vdkct4s72gw7182g: I also can't rely on a linked gene article's infobox for access to that information. Seppi333 (Insert 2¢) 03:45, 28 December 2019 (UTC)[reply]

@Seppi333: There are other places one can easily find this information. For example:

"Carbonic anhydrase". HUGO Gene Nomenclature Committee.
"Gene group: Carbonic anhydrases". HUGO Gene Nomenclature Committee.
"EC 4.2.1.1". SIB Swiss Institute of Bioinformatics. (ctrl-F human)
"name:"carbonic anhydrase" AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]"". UniProt.

TAAR1 is protein coding and not a pseudo gene. I also can't rely on a linked gene article's infobox for access to that information. Yes you can. The external links in the TAAR1 infobox for example are up-to-date and accurate. {{Infobox gene}} pulls its data from Wikidata which is a much better mechanism for storing this type of data compared to lists. Boghog (talk) 22:06, 28 December 2019 (UTC)[reply]

@Boghog: I meant to write TAAR in my post - I was referring to TAAR3 being listed as a human protein-coding gene in the linked revision. My phone is almost dead, but as long as the Infobox is actually populated, I agree. It isn’t always, as I pointed out in my link (the entire family of TCR protein subunits ~~was~~ is missing almost all gene infobox-linked data items on WD). Anyway, I prefer to use UniProt over HGNC for protein data. I agree that Expasy is a good source for enzymes. Addendum: Expasy doesn’t include the 3 acatalytic yet still functional hCA proteins.Seppi333 (Insert 2¢) 22:43, 28 December 2019 (UTC)[reply]

The problem with the TAAR3 was not the infobox/Wikidata, but the (semi)manually written lead sentence that did not anticipate special cases such as pseudogenes. The Wikidata from the very beginning correctly identified TAAR3 as a pseudogene. The problem with TRA (gene) is that the entire T-cell receptor alpha locus Wikidata item is missing. I am not sure why that is. This may be another special case (i.e., gene locus). I am sure that there are other mistakes and omissions in Gene Wikidata, but these are relatively rare and can/should be fixed. I am still not convinced that the gene lists are necessary. We already have a mechanism for storing this type of information in Wikidata. Boghog (talk) 09:06, 29 December 2019 (UTC)[reply]

@Boghog: Hmm. The issue wasn't with TAAR3 or just TRA, so let me clarify: Special:permalink/593700991#Animal TAAR complement, "Human — 7 genes (TAAR1, TAAR2, TAAR3, TAAR5, TAAR6, TAAR8, TAAR9) and 2 pseudogenes (TAAR4P, TAAR7P). As for the TCR proteins TRA, TRB, TRD, and TRG (i.e., an entire gene/protein family) are all missing virtually all the data that would normally supply their infoboxes ~~(provided their articles existed)~~, save for the aliases and the entrez link. Now, don't get me wrong, I completely agree with you that this data is ideally suited for wikidata and if the choice were mutually exclusive, I'd pick wikidata over WP as the place to put all gene/protein data. These aren't mutually exclusive alternatives though, and aggregating a large amount data from wikidata or generating one's own structured database on the scale of these gene lists, or even just accessing (nothing more) a complete list of protein-coding genes, is not something one can do unless they're at least a mildly competent programmer. I don't really see how creating a duplicate set of data on WP in these lists poses a problem for anyone, so it sort of confuses me as to why you're opposed to the idea of their existence. I don't think the lists are necessary either; that doesn't mean I'm not interested in working on them. Just to be clear, are you advocating that the lists should be deleted? Seppi333 (Insert 2¢) 14:54, 29 December 2019 (UTC)[reply]

I am not advocating deleting the list (I am generally an inclusionist). Furthermore, I do not necessarily see redundancyas a problem, but I would prefer fixing existing solutions before inventing new ones. I am still not getting what is the issue with the TAAR gene family. The Gene Wikidata is correct. Some of the associated information in Wikipedia main space is incorrect, but this is not the fault of the infoboxes or Wikidata. In fact, the main space information can be entirely corrected based on information stored in Wikidata and displayed in the infoboxes. TRA, RRB, TRD, and TRG are all gene loci and none of these are listed in the Ensembl database. Ensembl in turn was critical in supplying information to Wikidata. Therefore, it appears that gene loci are special cases that will need to be handled separately. Boghog (talk) 19:10, 29 December 2019 (UTC)[reply]

I finished updating the last set of dictionary links in the python code last week, so these lists won’t require any editor attention in the future unless my bot isn’t approved for this task. That said, if I hadn’t created the lists, I would not have worked with WT:WPDAB, and Certes in particular, to systematically disambiguate the entire GeneWiki (the mistargeted links to non-gene-related pages in the tables were corrected in the process). That work resulted in the addition of hundreds of DAB hatnotes or DABpage conversions at page titles for gene symbols which were about a non-gene-related topic. I also wouldn’t have proposed my 2nd bot task since I wouldn’t have been able to determine that there are ~2000 gene/protein pages that aren’t linked from their gene symbols.

So, creating these lists incidentally allowed me to identify and fix some rather significant issues that would have remained unresolved otherwise. Seppi333 (Insert 2¢) 00:56, 30 December 2019 (UTC)[reply]

@Pigsonthewing: I'm not sure if you care about this anymore since you didn't reply to my earlier ping or my post on your talk page, but I figured I'd let you know that I went ahead and reduced the page size by 50k by removing the Status column. These pages are now more-or-less as small as they're going to get w/o splitting them (~280k bytes), which seems to me fine considering that they're only about 40k larger than one of the FA-class articles I maintain. Anyway, feel free to ignore this message if you're still disinterested.

" These pages are now more-or-less as small as they're going to get (~280k bytes)" Then they're still far too big, and need to be spit several times (unless they're deleted in favour of using Wikidata). as for not replying to your questions; I'm still waiting for replies to mine: What is the use case for them? Who will use them? What do they offer that a category would not? Or that Wikidata does not? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 11:13, 2 January 2020 (UTC)[reply]

How am I supposed to answer that? Other than simply learning about a topic (different groups of genes), what other purpose does an article serve? I can't even answer those question about the FAs I've spent countless hours working on beyond how I would use them, which I have answered. WD doesn't provide lists and the gene category, even if actually fully-populated with all our gene articles, is still 8000 shy of the number included in these lists. The benefit of a list in this case is rather obvious. Seppi333 (Insert 2¢) 11:24, 2 January 2020 (UTC)[reply]

I find it far from obvious; so I've raised a deletion discussion to see whether the wider community considers them useful: Wikipedia:Articles for deletion/List of human protein-coding genes 1. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 12:42, 2 January 2020 (UTC)[reply]

I actually appreciate you doing that since it's probably the fastest way to generate consensus about these pages.

Seppi333 (Insert 2¢) 13:13, 2 January 2020 (UTC)[reply]

Break

@Primefac: I think I've discussed this ad nauseum with everyone. Each list page is over 150,000 bytes smaller than they were originally and I can't reduce their size any further without further page splits. I finished updating the link dictionary last week, so the source code is pretty much finalized right now. The only unresolved issue is the links which I discussed with Onetwothreeip above; I don't really see a way to address both concerns simultaneously. Seppi333 (Insert 2¢) 07:21, 2 January 2020 (UTC)[reply]

It does, and the AFD will likely answer (or at least shed a little light on) the remaining concerns I've seen mentioned. I've watchlisted the AFD and will make a final decision after it is closed (unless another BAG wants to hop in and take over). Primefac (talk) 15:22, 2 January 2020 (UTC)[reply]

The AfD went more-or-less how I expected. Ironically, several people suggested merging the lists, but it might not be a good idea to do that given the opposing view. I went ahead and added the gene ranges to the navbox instead of renaming the pages to a letter-based index as suggested in the discussion; it might be a good idea to prepend the word "page" before the number in these pages (e.g., List of human protein-coding genes page 1/2/3/4" anyway though since 1 person found the title confusing. I'll implement that as long as you're fine with me renaming/moving the pages. That was all the actionable feedback for me from that page.

I think basically all of the points discussed on this page were directly addressed in the AfD, except for the ELs. One of the reasons I advocated retaining the ELs in the lists, as mentioned in the discussion above, is to ensure that the notability of the list entries is easily verifiable if not self-evident. It would have been difficult, if not impossible, to explain and show without any ambiguity that there exists citable literature and data/info-base entries for every single one of them (i.e., why they're all notable genes) in the AfD discussion if I had removed those ELs beforehand. Seppi333 (Insert 2¢) 04:09, 18 January 2020 (UTC)[reply]

Also, if you're fine with me modifying the source code one last time, I could program the bot to automatically update the 8 genes that serve as list cutoff indices in the navbox for these pages (transcluded below) at the same time that it updates the lists. Those cutoffs seem to change about once every 2-4 weeks when HGNC updates their data. It would be very easy to program that bot function since I'd just need to slightly expand the loop to record the names of the genes corresponding to indices 1, 5000, 5001, 10000, 10001, 15000, 15001, and the last index, then use regex and the list of genes for those indices to rewrite the navbox text. Seppi333 (Insert 2¢) 04:47, 18 January 2020 (UTC)[reply]

Human protein-coding gene pages:
•Python code for maintaining the list
•List of human protein-coding genes page 1 covers genes A1BG–EPGN
•List of human protein-coding genes page 2 covers genes EPHA1–MTMR3
•List of human protein-coding genes page 3 covers genes MTMR4–SLC17A7
•List of human protein-coding genes page 4 covers genes SLC17A8–ZZZ3
NB: Each list page contains 5000 human protein-coding genes, sorted alphanumerically by the HGNC-approved gene symbol.
Follow the Python code link for information about updates to the list of genes on these pages.

Approved. For clarity:

The bot is approved to update the related pages a maximum of once per week
If a consensus determines that the pages should be split further, or merged together, the bot will still be able to update the pages.

If there are further concerns about the bot or this task, they should be raised at this talk page and/or BOTN depending on the issue and severity. Primefac (talk) 12:27, 18 January 2020 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.